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Abstract 

We present a new family of model selection algorithms based on 
the resampling heuristics. It can be used in several frameworks, do 
not require any knowledge about the unknown law of the data, and 
may be seen as a generalization of local Rademacher complexities and 
l^-fold cross-validation. In the case example of least-square regression 
on histograms, we prove oracle inequalities, and that these algorithms 
are naturally adaptive to both the smoothness of the regression func- 
tion and the variability of the noise level. Then, interpretating F-fold 
cross-validation in terms of penalization, we enlighten the question 
of choosing V. Finally, a simulation study illustrates the strength 
of resampling penalization algorithms against some classical ones, in 
particular with heteroscedastic data. 

1 Introduction 

Choosing between the outputs of many learning algorithms, from the predic- 
tion viewpoint, remains to estimate their generalization abilities. A classical 
method for this is penalization, that comes from model selection theory. Ba- 
sically, it states that a good choice can be made by minimizing the sum of 
the empirical risk (how does the algorithm fits the data) and some com- 
plexity measure of the algorithm (called the penalty). The ideal penalty for 
prediction is of course the difference between the true and empirical risks of 
the output, but it is unknown in general. It is thus crucial to obtain tight 
estimates of such a quantity. 
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Many penalties or complexity measures have been proposed, both in the 
classification and regression frameworks. Consider for instance regression 
and least-square estimators on finite-dimensional vector spaces (the models). 
When the design is fixed and the noise-level constant equal to a, Mallows' 
C p penalty [16] (equal to 2n~ 1 a 2 D for a Z)-dimensional space, and it can 
be modified according to the number of models [5],[l7]) has some optimality 
properties [T8 l [T5 l 12]. However, such a penalty linear in the dimension may 
be terrible in an heteroscedastic framework (as shown by (j2j) and experiment 
HSd2 in Sect. E]). 

In classification, the VC-dimension has the drawback of being indepen- 
dent of the underlying measure, so that it is adapted to the worst case. 
It has been improved with data-dependent complexity estimates, such as 
Rademacher complexities pCS S] (generalized by Fromont with resampling 
ideas [H]), but they may be too large because they are still global complexity 
measures. The localization idea then led to local Rademacher complexities 
[U [H] which are tight estimates of the ideal penalty, but involve unknown 
constants and may be very difficult to compute in practice. On the other 
hand, the K-fold cross-validation (VFCV) is very popular for such purposes, 
but it is still poorly understood from the non-asymptotic viewpoint. 

In this article, we propose a new family of penalties, based on Efron's 
bootstrap heuristics [10] (and its generalization to weighted bootstrap, i.e. 
resampling). It is a localized version of Fromont's penalties, which does not 
involve any unknown constant, and is easy to compute (at the price of some 
loss in accuracy) in its V^-fold cross-validation version. We define it in a much 
general framework, so that it has a wide range of application. As a first the- 
oretical step, we prove the efficiency of these algorithms in the case example 
of least-square regression on histograms, under reasonable assumptions. In- 
deed, they satisfy oracle inequalities with constant almost one, asymptotic 
optimality and adaptivity to the regularity of the regression function. This 
comes from explicit computations that allow us to deeply understand why 
these penalties are working well. Then, we compare the "classical" VFCV 
with the V"-fold penalties, enlightening how V should be chosen. Finally, we 
illustrate these results with a few simulation experiments. In particular, we 
show that resampling penalties are competitive with classical methods for 
"easy problems", and may be much better for some harder ones (e.g. with a 
variable noise-level). 
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2 A general model selection algorithm 



We consider the following general setting : X x y is a measurable space, P 
an unknown probability measure on it and (Xi, Y\), . . . , (X n , Y n ) G X x y 
some data of common law P. Let S be the set of predictors (measurable 
functions X i— ► 3^) and 7:5x(A'x} , )h^Ia contrast function. Given a 
family {s'm)meM„ °f data-dependent predictors, our goal is to find the one 
minimizing the prediction loss Pj(t). We will extensively use this functional 
notation Qj(t) := E(x,y)~g[7(i, for any probability measure Q on 

A? x y. Notice that the expectation here is only taken w.r.t. (X,Y), so 
that Qj(t) is random when t = s m is random. Assuming that there exists a 
minimizer s G S of the loss (the Bayes predictor), we will often consider the 
excess loss l(s, t) = Pj(t) — P"f(s) > instead of the loss. 

Assume that each predictor s m may be written as a function s° m {Pn) 
of the empirical distribution of the data P n = n^J^^i^lXiXi)- The ideal 
choice for m is the one which minimizes over A4 n the true prediction risk 
P^(s m (P n )) = -P„7(s m (P n )) + pen id (m) where the ideal penalty is equal to 

pen id (m) = (P-P„) 7 (s: m (P„)) . 

The resampling heuristics (introduced by Efron [TOj) states that the ex- 
pectation of any functional F(P, P n ) is close to its resampling counterpart 
KwF(P n , P^), where P^ = n~ x Y17=i ^i^iXuYi) i s the empirical distribu- 
tion P n weighted by an independent random vector W G [0; +oo) n , with 
JAEfPl/j] = n. The expectation E w [-] means that we only integrate w.r.t. 
the weights W . 

We suggest here to use this heuristics for estimating pen id (m), and plug 
it into the penalized criterion P n j(s m ) + pen(m). This defines m G M. n as 
follows. 

Algorithm 1 (Resampling penalization). 1. Choose a resampling scheme, 
i.e. the law of a weight vector W . 

2. Choose a constant C > C w « (n' 1 ^" =1 E (W { - l) 2 )^ 1 . 

3. Compute the following resampling penalty for each m G M. n : 

pen(m) = CE W [P nl {s m (Pf )) - Pf 7 (s m (Pf ))] . 

4. Minimize the penalized empirical criterion to choose fh and thus : 

m G arg min {P n 7(s m (P n )) + pen(m)} . 

meMn 
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Remark 1. 1. There is a constant C ^ 1 in front of the penalty, although 
there isn't any in Efron's heuristics, because we did not normalize W. 
The asymptotical value of the right normalizing constant Cw may be 
derived from Theorem 3.6.13 in [2 1 J . In the case example of histograms, 
we give a non-asymptotic expression for it ((3j). In general, we suggest 
to use some data-driven method to choose C (see algorithm [3]), whereas 
the resampling penalty only estimates the shape of the ideal one. 

2. We allowed C to be larger than Cw because overpenalizing may be 
fruitful in a non-asymptotic viewpoint, e.g. when there is few noisy 
data. 

3. Because of this plug-in method, algorithm CD seems to be reasonable 
only if A4 n is not too large, i.e. if it has a polynomial complexity 
: Card(.M n ) < c^n QA1 . Otherwise, we can for instance group the 
models of similar complexities and reduce Ai n to a polynomial family. 

3 The histogram regression case 

As studying algorithm Q] in general is a rather difficult question, we focus 
in this article on the case example of least-square regression on histograms. 
Although we do not consider histograms as a final goal, this first theoretical 
step will be useful to derive heuristics making the general algorithm Q] work. 

We first precise the framework and some notations. The data (X^Yi) G 
X X R are i.i.d. of common law P. Denoting s the regression function, we 
have 

Yi = s(X t ) + <T{Xi)ei (1) 

where a : X \— > R is the heteroscedastic noise-level and 6j are i.i.d. centered 
noise terms, possibly dependent from Xj, but with variance 1 conditionally 
to Xj. The feature space X is typically a compact set of R d . We use the 
least-square contrast 7 : (t,(x,y)) 1— > (t(x) — y) 2 to measure the quality 
of a predictor t : X 1— ► y. As a consequence, the Bayes predictor is the 
regression function s, and the excess loss is l(s, t) = E,( X ,y)~p (t(X) — s(X)) 2 . 
To each model S m , we associate the empirical risk minimizer s~ m = s~ m (P n ) = 
argmin tgS ' m {P ri 7(t)} (when it exists and is unique). 

Each model in (S m ) m& M n is the set of piecewise constant functions (his- 
tograms) on some partition \£A m of X. It is thus a vector space of di- 
mension D m = Card(A m ), spanned by the family (1/ a )a€A to - As this basis is 
orthogonal in L 2 (n) for any probability measure on X , we can make explicit 
computations that will be useful to understand algorithm [H The following 
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notations will be useful throughout this article. 

p x := P(X G h) p x := P n (X eh) pj = p x W x := Pf (X G I x 

s m := arg min Pj(t) = V (3 x t h (3 X = E P [Y\X G J A ] 

AeAm 

s m := argminP n 7(t) = V" P x l h fix = — ^- V" Y { 

^ := arg min P^(t) = £ ^ = JL £ ^ 



Remark that s m is uniquely defined if and only if each I x contains at least one 
of the Xi, and the same problem arises for s^. This is why we will slightly 
modify the general algorithm for histograms. Before this, we compute the 
ideal penalty (assuming that mm Xe \ m p x > ; otherwise, the model m should 
clearly not be chosen) : 

pen id (m) = (P - P n )^(s m ) = ^ (p A + p x ) (fix - (3^ + (P - P n )^{s m ) . 

AeA m 

The last term in the sum being centered, it is estimated as zero by the 
resampling version of pen id . The first term is a sum of D m terms, each one 
depending only on the restrictions of P and P n to I x . Thus, if we assume 
that px > and if we compute separately all those terms, conditionally to 
p^ > 0, we can define the resampling version of pen id (m). This leads to the 
following algorithm. 

Algorithm 2 (Resampling penalization for histograms). 0. Choose a thresh- 
old A n > 1 and replace A4 n by 

M. n = < m G M. n s.t. min {np x } > A n 
{ AeA m 

1. Choose a resampling scheme C(W). 

2. Choose a constant C > CV(Ai) where Cw is defined by ([3D. 
3'. Compute the following resampling penalty for each m G M. n 



pen(m) = C ^ E w (p x + pf) (3f - A 



AeA,- 



2 W x > 
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4'. Minimize the penalized empirical criterion to choose m and thus : 
rh G arg min {P n l(s m (P n )) + pen(m)} . 

meMn 

Remark 2. 1. The two modifications of the algorithm for histograms do 
not affect much the result if A n is of the order ln(n). Indeed, models 
with very few data are not relevant in general, and if mm\ e A m {np\} > 
A n is not too small, the event {W\ = 0} has a very small probability. 

2. We allow C to depend on A n since the "optimal" constant Cw may de- 
pend on it, but this dependence is mild according to our computations. 

When the resampling weights are exchangeable (see definition below), we 
are able to compute pen explicitly. It is enlightening to compare it with pen id 
in expectation, conditionally to (pa) AeA m (we denote by E m [•] this conditional 
expectation) : 



E™ [pen id (m)] = \ £ (l + f) + {**) 

E m [pen(m)] = - ^ (Ri, w (n,Px) + R2,w(n,Px)) fe) 2 + K) 

AGAm 

with K) 2 := E[a(x) 2 \X G h] ; (aQ 2 := E[(s(X) - s rn {X)f\X G /. 



(2) 



and for k = 1,2 R k W {n,p\) = E 



3-k 



W x >0 



Hence, contrary to Mallows' penalty (with a 2 known or estimated), resam- 
pling penalties really take into account the heteroscedasticity of the noise (cr r x 
depends on A) and the bias terms (af) . We then define 



C w (A n ):= sup - (3) 
np x >A n L tti,w{n,p\) + ri2,w{n,Px) 

and C{y(A„) is the infimum of the same quantity. 



Examples of resampling weights 

In this article, we consider resampling weights W = (Wi, . . . , W n ) G [0; +oo) n 
such that E[Wj] = 1 for all % and E[W^ 2 ] < oo. We mainly consider the follow- 
ing exchangeable weights (i.e. such that for any permutation r, (W T (i), . . . , W T ( n )) 
(W 1 ,...,W n )). 
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1. Efron(q): multinomial vector with parameters (g; n , ...,n x ). Then, 
R2,w( n ,P\) — ( n /l) x (1 — ("■Pa) -1 )- A classical choice is g = n. 

2. Rademacher : IVj i.i.d., 2 times Bernoulli(l/2). Then, -R2,w( n >PA) = 1- 

3. Random hold-out (g) (or cross-validation) : Wi = -lie/ with / uniform 
random subset (of cardinality g) of {1, ... , n}. R 2t w( n , Px) — ( n /<l) — 1- 
A classical choice is q = n/2. 

4. Leave-one-out = Random hold-out (n — 1). Then, R 2 w(n,p\) — (n — 

In each case, we can show that R\,w = ^2,w(l + ^i^) f° r some explicit 

small term (numerically of the same order as E\p\/p\\p\ > 0] — 1 in 

expectation for the three first resamplings, and slightly smaller in the Leave- 
one-out case). Thus, Cw ~ C' w ~ R 2 w (asymptotically in A n ). 

For computational reasons, it is also convenient to introduce the following 
V-fold cross-validation resampling weights : given a partition (^)i<i<v °f 
{1, . . . , n} and W B G M> v leave-one-out weights, we define Wi = Wf for each 
% G Bj. The partition should be taken as regular as possible, and then we 
can compute E[pen(m)] and show that Cw — 1. 

The Rademacher weights lead to penalties close in spirit to local Rademacher 
complexities (the link between global Rademacher complexities and global re- 
sampling penalties with Rademacher weights can be found in [11]). The links 
with the classical leave-one-out and VFCV algorithms are given in Sect. El 

4 Main results 

In this section, we prove that algorithm [2] has some optimality properties 
under the following restrictions for some non-negative constants ocm, c Mi c a, 

Crich • 

(PI) Polynomial complexity of M n : Card(.M n ) < c M n aM . 

(P2) Richness of M n : Vx G [1, nc^! h ], 3m G M n s.t. D m G [x; c ric h^]- 

(P3) The weights are exchangeable, among the examples given in Sect. [31 

(P4) The threshold is large enough : Ca^{ti) > A n > (26 + 7«x) ln(n). 

Assumption (PI) is almost necessary, since too large families of models need 
larger penalties than polynomial families p3, EJ CLZ]. Assumption (P2) is 
necessary but it is always satisfied in practice. Assumption (P3) is only here 
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to ensure that we have an explicit formula for the penalty, and sharp bounds 
on R\ t w and R,2,w- The constant (26 + 7om) in (P4) is quite large due to 
technical reasons, but much smaller values (larger than 2) should suffice in 
practice. 

Theorem 1. Assume that the (Xj,Yj)'s satisfy the following assumptions : 
(Ab) Bounded data : \\Yi\\ ^ < A < oo . 

(An) Noise-level bounded from below : cr(Xi) > cr min > a.s. 
(Ap) Polynomial decreasing of the bias : 

3/3i > > 0, C s , c s > s.t. CsD-" 1 < l{s, s m ) < C S D~^ . 

(Ar) (pseudo)- Regular histograms : Vm G M. n , imn\ & \ m {p\) > c reg .D~ . 

Let m be the model chosen by algorithm^ (under restrictions (Pl~4)), 
with 7]'C[y(A n ) > C > r]Cw{A n ) for some r),T)' > |. R satisfies, with proba- 
bility at least 1 — -^(A),(P) ra ~ 2 (L(a),(P) may depend on constants in (A) and 
(P), but not on n), 

l{s,Sfn) <K(V,V') inf {l(s,s m )} . (4) 

m£Mn 

At the price of enlarging L(a),(p)j the constant K(r),T]') can be taken close 
to (1 + 2(7/ — 1)+)(1 — 2(1 — r/) + ) -1 , where x + := max(i, 0). In particular, 
K(n,n') is almost 1 if n and rf are close to 1. 
Moreover, we have the oracle inequality 

E[Z(s,SfiO] <1T(77,»/)! 

sketch. By definition of m, 

Vm G M n , (pen-pen- d )(m) +Z(a,s jR ) < l(s,s m ) + (pen - pen- d )(m) 

where we replaced pen id by pen( d := pen id — (P n — P)j(s). In order to obtain 
(jlj) with A^n instead of .M n , we show concentration inequalities for pen(m) — 
pen[ d (m) around zero, with remainders <C Z(s,s~ m ) if -D m is large (larger than 
some power of ln(n)). We use the following steps : 

1. explicit computation of penj d and pen when W is exchangeable. 



inf {l(s,s m )} 

rneMn 



+ 



A l L 



(A),(P) 



n- 



(5) 
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2. accurate bounds on R± t w an d P.2,Wi so that (1 — 5(A n ))~E m [2p 2 (m)] < 
E m [pen(m)] < {l+5{A n ))E m [2p 2 (m)} withp 2 (m) = P n ( 7 (s m ) - j{s m )) 
and lim^^oc 5 = 0. This needs sharp bounds on ¥\Z~ X \Z > 0] with 
C(Z) = £(W\\p\), for each resampling scheme introduced in Sect. El 

3. moment inequalities for pen, p 2 and pi{m) = P (j('s m ) ~ l( s m))i condi- 
tionally to (p\)\eA m , around their conditional expectations. This step 
uses results from |7|, or can be derived from [TJ], since all those quan- 
tites are U-statistics of order 2 (this last fact is not true without the 
conditioning). This implies (unconditional) concentration inequalities. 

4. concentration inequality for (P n — P)(7(s m ) — j(s)) (Bernstein's in- 
equality suffices in the bounded case). 

5. since E m [p2(^)] = E [^(m)], it only remains to prove that E m [pi(m)] « 
E [pi(m)] and p 2 ~ pi with high probability. We here use the Cramer- 
Chernoff method (it can be used since the (pa) AeA m are negatively as- 
sociated [9j), together with estimates of the exponential moments of 
the inverse of a binomial random variable. Controlling the remainder 
needs a lower bound on minA G A m {^PA} that comes from (P4) (and 
Bernstein's inequality). 

6. using the assumptions, all the remainders in our concentration inequal- 
ities are much smaller than E[Z(s, when D m > D (n) = ci(ln(n)) C2 
(with Ci,c 2 depending on the constants in the assumptions). 

Let m* be a minimizer of l(s,s m ) over M. n (with an infinite loss when 
s m is not uniquely defined). It remains to prove that, with large probability, 
Dfh > D (n), D m * > D (n) and m* G M. n . These hold for n large enough 
thanks to (Ap) and (Ar) (we did not use (Ar) before). 

We finally show that ((4j) implies (El) : let fl n be the event of probability 
1 — £(a),(p)^™ 2 on which ([4]) occurs. On f^j, Z(s,s„) is bounded by A 2 , so 
that 

E [1{S, Srn)} = E [l( S , Sfn)ln n } + E [l{s, 8^)1^] 



< K(7],7]')E 



inf Z(s,s r 

m£Mn 



+ L {Ah(P) A 2 n 2 . □ 



□ 



Theorem [TJ implies the a.s. asymptotic optimality of algorithm [3] in this 
framework. This means that if s and cr(X) do not make the model selection 
problem too hard, the resampling penalization algorithm is working, without 
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any knowledge on the smoothness of s, the heteroscedasticity of a or any 
property that the unknown law P may satisfy. In that sense, it is a naturally 
adaptive algorithm. 

The lower bound in assumption (Ap) may seem strange, but it is intuitive 
that when the bias is decreasing very fast, the optimal model is of quite small 
dimension. Then, bounds relying on the fact that this dimension is large can 
not work. The same kind of assumption has already been used in the density 
estimation framework for the same reason |20j. 

Moreover, we can prove that non-constant holderian functions satisfy 
(Ap) when X has a lower-bounded density w.r.t. the Lebesgue measure 
on X C E. The following result states that resampling penalization is adap- 
tive to the holderian smoothness of s in an heteroscedastic framework, since 
it attains the minimax rate of convergence n~ 2a ^ 2a+1 ' [T9] . 

Theorem 2. Let X be a compact interval of R and y C M.. Assume that 
(Xi,Yi) satisfy (Ab), (An) and the following assumptions : 

(Ad) Density bounded from below : 3c^ in > ; V/ C X, P(X E I) > 
Cmin Leb(J). 

(Ah) Holderian regression function : there exists a E (0; 1] and R > s.t. 
sEH(a,R) i.e. Vxi, x 2 E X, \s(xi) — s(x 2 )\ < R |xi — x 2 \ a ■ 

Let Ai n be the family of regular histograms of dimensions 1 < D < n, 
m the model chosen by algorithm IE, with (P3-4) satisfied (olm — 0) and 
C like in Theorem^ Then, denoting cr max = sup^ \a\ < A, there are some 
constants L 2) (a),(p) (that may depend on all the constants in the assumptions) 
and Li(r], rf, a) such that 

E [l( 8 , < Ll n-^^R 2a /^a^ 2a ^ + L UAUP) n~ 2 . (6) 

Moreover, if a is K a -Lipschitz, the constant may be replaced by f x a{t) 2 dt 
(at the price of enlarging L 2 ^a),(p))- 

sketch. 1. Since a E (0;1], any non-constant function s E H(a,R) sat- 
isfies (Ap) with j3 2 = 2a and (5\ = 1 + a~ l (the lower bound uses 
(Ad)). 

2. Assumptions (PI), (P2) and (Ar) are automatically satisfied by the 
regular family, so we can use j5]). From the proof of Theorem [U we 
obtain estimations of E [l(s, ~s m )]. Optimizing in D m gives j6]) for non- 
constant functions. 
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3. When s is constant, a direct proof shows that is at most of order 
ln(n)^ 1 with large probability. This ensures that E[Z(s, 'Sfn)) is at most 
of order (ln(n)) & n -1 n -2a/(2a+i) f or every a > q. □ 

□ 

Other results like Theorem [T] may be proved under other assumptions : 
unbounded data (with moment inequalities for the noise, regularity assump- 
tions on s and an upper bound on a), a(x) that can vanish (with the un- 
bounded assumptions, E[cr 2 (X)] > and some regularity on a), etc. We skip 
their detailed statements in order to focus on the last two sections, where 
we give a new look on V^-fold cross-validation (seen from the penalization 
viewpoint) and illustrate theoretical results with a simulation study. 



5 Links with V-fold cross-validation 

The results of Sect. S] assume that the weights are exchangeable. However, 
computing exactly the resampling penalties with such weights may be quite 
long : without a closed formula for pen, has to be computed for at least 
n (and up to 2 n ) different weight vectors. Using the ^-fold idea, we defined 
VFCV weights in Sect.[3l that allows to compute each penalty by considering 
only V different weight vectors. We call the resulting algorithm penVFCV. 

It is quite enlightening to compare penVFCV to a more classical version 
of VFCV, where the final estimator is with 

f 1 V } 

m G arg min {crit V FCv(' m )} = arg min < — P^yCs^^) > . (7) 

rnGMn m€Mn V z — ' 

I 3=1 ) 

The superscript (J) (resp. (—j)) above means that P n and s m are computed 
with the data belonging to the block Bj (resp. to Bj). Assuming that the 

V blocks have the same size (and forgetting unicity issues of , that may 
be solved as before), we have (for any j) 

E [critvpcvM] = Pl(s m ) + E [Py(s^ j) ) ~ ^7M] (8) 

= P^ m) + £ (i + CJ ((-D 2 + (4) 2 ) 

K ' xeAm 

where 8n2 x is typically small and non-negative (when np\ is large enough). 

On the other hand, we can compute exactly the expectation of the pen- 
VFCV criterion (with a constant C = CV = V — 1) when the blocks have 
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the same size : 

E [crit penVFC v(m)] = P 7 (* m ) + \ i 1 + ((^a) 2 + K) 2 ) (9) 

A6Am 

for some typically small non-negative 5n,^ ■ 

Comparing (jHJ) and (J9j) with ([2]), one can see that up to small terms, both 
criterions are in expectation the sum of the bias and a variance term. The 
main difference between them lies in the constant in front of the variance : it 
is equal to C/(V — 1) = 1 for penVFCV, whereas it is equal to VJ (V — 1) > 1 
for VFCV. 

The classical \^-fold cross-validation is thus "overpenalizing" within a fac- 
tor V/(V — 1) because it estimates the generalization ability of sin , which is 
built upon less data than s m . This enlightens some clues for the choice of V : 
computational issues (the smaller V, the faster will be the algorithm), stabil- 
ity of the algorithm (V = 2 is known to be quite unstable, and leave-one-out 
much more stable), and overpenalization (V/(V — 1) should not be too far 
from 1). Our analysis do not quantify the stability issue, but it is sufficient 
to explain why the asymptotic optimality of leave-p-out needs p <C n for a 
prediction purpose [15] and p ~ n for an identification purpose [22]. Indeed, 
the overpenalization factor is n/(n — p) = (1 — p/n)^ 1 should go to 1 for 
optimal prediction and to infinity for a.s. identification. Moreover, from the 
non-asymptotic viewpoint (n small and a large, or s irregular), it is known 
that overpenalization (i.e. positively biased penalties) gives better results. 
This means that the better V may not always be the largest one for classical 
V^-fold, independently from computational issues. 

On the contrary, penVFCV is not overpenalizing, unless we explicitly 
choose C > Cw- We thus do not have to take into account the third fac- 
tor for choosing V, so that it may be more accurate than VFCV within a 
smaller computation time. In the non-asymptotic viewpoint (or for an iden- 
tification purpose), it is also easier to overpenalize when we need to, without 
destabilizing the algorithm by taking a small V. 

A refined analysis of the "negligible" terms such as 5n t p™ V \ compared to 
the expectation of p\/p\, explains why the leave-one-out may be overfitting 
a little (see the simulations hereafter). We do not detail this phenomenon 
since it disappears when V/(V — 1) stays away from 1. 

6 Simulations 

To illustrate the results of Sect. [4] and the analysis of Sect. El we compare 
the performances of algorithm [2] (with several resampling schemes), Mallows' 
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0.5 1 0.5 1 



Figure 4: s = HeaviSine 

(see [8]) Figure 5: HSdl Figure 6: HSd2 

C p and VFCV on some simulated data. 

We report here four experiments, called SI, S2, HSdl and HSd2. Data 
are generated according to (pQ) with Xi i.i.d. uniform on X = [0; 1] and 
€i ~ A/"(0, 1) independent from X{. They differ from the regression function 
s (smooth for S, see Fig. [T]; smooth with jumps for HS, see Fig. 01), the noise 
type (homoscedastic for SI and HSdl, heteroscedastic for S2 and HSd2), the 
number n of data, and are repeated N = 1000 times. Instances of data sets 
are given in Fig. [2J13] and EHEl Their last difference lies in the families of 
models M. n : 

51 regular histograms with 1 < D < pieces. 

52 histograms regular on [0; |] and on [|; l] , with D\ (resp. D 2 ) pieces, 
1 < Di, D 2 < 2\n(n) ■ ^ model of constant functions is added to M. n . 

HSdl dyadic regular histograms with 2 k pieces, < k < ln 2 (n) — 1. 

HSd2 dyadic regular histograms with bin sizes 2~ fcl and 2~ fe2 , < ki,k 2 < 
ln 2 (n) — 1 (dyadic version of S2). The model of constant functions is 
added to M. n . 

We compare the following algorithms : 
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Mai Mallows' C p penalty : pen(m) = 2a 2 D m n 1 where a 2 is the variance 
estimator used in [T], Sect. 6. 

VFCV Classical V-fold cross-validation, defined by Q, with V G {2, 5, 10, 20}. 

penEfr Efron (n) penalty, C = Cw = 1- 

penRad Rademacher penalty, C = Cy^ = 1. 

penRHO Random hold-out (n/2) penalty, C = Cw = 1- 

penLOO Leave-one-out penalty, C = Cjy = n — 1. 

penVFCV V-Md penalty, with V G {2, 5, 10, 20}. C = C w = V - 1. 

For each of these except VFCV, we also consider the same penalties multi- 
plied by 5/4 (denoted by a + symbol added after its shortened name). This 
intends to test for overpenalization. 

In each experiment, for each simulated data set, we first remove the mod- 
els with less than A n = 2 data points in one piece of their associated partition. 
Then, we compute the least-square estimators s~ m for each m G M. n . Finally, 
we select m G M. n using each algorithm and compute its true excess risk 
l{s,s'fn) (and the excess risk of each model m G M. n )- Since we simulate N 
data sets, we can then estimate the two following benchmarks : 



E[l(s,$n)] 

E[inf me . Mn /(s,s m )] 



C nr — r^T. — — ~ TT Cpath-or — E 



l(s,s f 



Basically, C OT is the constant that should appear in an oracle inequality like 
((H), and Cp a th-or corresponds to a pathwise oracle inequality like (j5J). As C OT 
and Cp at h-or approximatively give the same rankings between algorithms, we 
only report C or in Tab. [D 

We always observe that penRad and penRHO are competitive with Mai 
(SI) and much better for more "difficult" problems (S2 is heteroscedastic ; 
jumps in HSdl and HSd2 induce much bias). On the other hand, VFCV is 
a little worse than Mai for easy problems (SI) and better for more difficult 
ones, but never better than penRad or penRHO. 

The best resampling schemes (not taking overpenalization into account) 
are penRad and penRHO, in view of SI and S2 (dyadic models do not in- 
duce much differences between them in HSdl and HSd2). Then, penLOO 
is slightly underpenalizing and penEfr strongly overfits. The comparison 
penRad ~ penRHO > penLOO 3> penEfr can also be derived from Sect. [3l 

In the four experiments, overpenalizing within a factor 5/4 leads to better 
results, mainly because n is quite small for the noisy (SI, S2) or irregular 
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(HSdl, HSd2) signals observed. This is no longer the case for some larger n 
or smaller a. 

We consider now V^-fold algorithms. VFCV is slightly better than pen- 
VFCV, but worse than penVFCV+. The influence of V on C or confirms the 
discussion of Sect. El For VFCV, the best V may be V = 2 (which overpe- 
nalizes, HSdl) or V = 20 (which is more stable, HSd2), or even both (S1,S2). 
On the contrary, penVFCV (and penVFCV+) is always improved when V 
increases, or at least it does not get worse. Then, the best one is penLOO 
(or penLOO+), i.e. V = n, the small terms Sn^^ being far less important 
than stability. This enlightens the interest of defining V^-fold penalties, for 
which it is easier to solve the complexity-accuracy trade-off. 

Remark 3. We only report here the result of 4 experiments, but several other 
ones (with n larger, a smaller, a(x) = l^i.i] or other regression functions s 

such as Doppler, ^ and a regular histogram) give the same kind of results. 
The constants C OT and C pat h-or are decreasing to 1 when n increases and a 
decreases. 

The overpenalization factor 5/4 is generally not optimal, and even not 
always better than 1 (in particular when n is large or a small). We have 
for instance C or (penLOO) < C or (penRHO) < C or (penRHO+) in SI with 
a = 0.1 (with only small differences). 

On the tuning parameters 

The above simulations confirm that the best weights (for accuracy) are 
Random hold-out (n/2) and Rademacher, whereas 1^-fold or leave-one-out 
weights may be of interest for computational purposes. The second tuning 
parameter, A n , may be taken equal to 2 (its "minimal" value because terms 
of the penalty with np\ = 1 would be zero) without serious consequences on 
C or in practice. 

On the contrary, the constant C > Cw is quite important, and the best 
ratio C/Cw strongly depends on n, a, s and M. n . Moreover, there is no 
reason for Cw (histograms) to be the right non-asymptotic constant in the 
general algorithm [D Our suggest is to choose C with the so-called "slope 
heuristics", proposed by Birge and Massart [6] for penalties linear in dimen- 
sion. Their claim is that the optimal penalty is twice the minimal penalty, 
i.e. the one under which the selected model is obviously too large. This leads 
to estimating the shape of pen id by resampling, and the constant C with the 
slope heuristics, as follows. 
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Table 1: Accuracy indexes C or for each algorithm in four experiments, ± 
a rough estimate of uncertainty of the value reported (i.e. the empirical 
standard deviation divided by a//V). In each column, the more accurate 
algorithms (taking the uncertainty into account) are bolded. 



H/Xpeiiment 


SI 






oz 








nociz 


s 


sin 






sin 






HeaviSine 


HeaviSine 


a(x) 


1 






X 






1 


X 


n (data) 


200 






200 






2048 


2048 


M n 


regular 




2 bin 


sizes 


dyadic, regu- 


dyadic, 2 bin 
















lar 


sizes 


Mai 


1.928 


± 


0.04 


3.864 


± 


0.02 


1.606 ±0.015 


1.487 ±0.011 


Mal+ 


1.800 i 


:0.03 


i 4.047 


± 


0.02 


1.606 ±0.015 


1.487 ±0.011 


2-FCV 


2.078 


± 


0.04 


2.542 


± 


0.05 


1.002 ±0.003 


1.184 ±0.004 


5-FCV 


2.137 


± 


0.04 


2.582 


± 


0.06 


1.014 ±0.003 


1.115 ±0.005 


10-FCV 


2.097 


± 


0.05 


2.603 


± 


0.06 


1.021 ±0.003 


1.109 ±0.004 


20-FCV 


2.088 


± 


0.04 


2.578 


± 


0.06 


1.029 ±0.004 


1.105 ±0.004 


yj VjllJ ILL 


2.597 


± 


0.07 


3.152 


± 


0.07 


1 067 ± 005 


1 114 ± 005 


nenRad 


1.973 


± 


0.04 


2.485 


± 


0.06 


1 018 ± 003 


1 102 ± 004 


penRHO 


1.982 


± 


0.04 


2.502 


± 


0.06 


1.018 ±0.003 


1.103 ±0.004 


penLOO 


2.080 


± 


0.05 


2.593 


± 


0.06 


1.034 ±0.004 


1.105 ±0.004 


pen2-FCV 


2.578 


± 


0.06 


3.061 


± 


0.07 


1.038 ±0.004 


1.103 ±0.005 


pen5-FCV 


2.219 


± 


0.05 


2.750 


± 


0.06 


1.037 ±0.004 


1.104 ±0.004 


penlO-FCV 


2.121 


± 


0.05 


2.653 


± 


0.06 


1.034 ±0.004 


1.104 ±0.004 


pen20-FCV 


2.085 


± 


0.04 


2.639 


± 


0.06 


1.034 ±0.004 


1.105 ±0.004 


penEfr+ 


2.016 


± 


0.05 


2.605 


± 


0.06 


1.011 ±0.003 


1.097 ±0.004 


penRad± 


1.799 ±0.03 


i 2.137 ±0.05 


1.002 ±0.003 


1.095 ±0.004 


penRHO± 


1.798 i 


:0.03 


, 2.142 ± 


0.05 


1.002 ±0.003 


1.095 ±0.004 


penLOO± 


1.844 ± 0.03 


2.215 ±0.05 


1.004 ±0.003 


1.096 ±0.004 


pen2-FCV+ 


2.175 


± 


0.05 


2.748 


± 


0.06 


1.011 ±0.003 


1.106 ±0.004 


pen5-FCV+ 


1.913 


± 


0.03 


2.378 


± 


0.05 


1.006 ±0.003 


1.102 ±0.004 


penlO-FCV+ 


1.872 


± 


0.03 


2.285 


± 


0.05 


1.005 ±0.003 


1.098 ±0.004 


pen20-FCV+ 


1.898 


± 


0.04 


2.254 


± 


0.05 


1.004 ±0.003 


1.098 ±0.004 
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Algorithm 3 (Resampling penalization with slope heuristics). 1. Choose 
a resampling scheme, i.e. the law of a weight vector W . 

2. Compute the following resampling penalty for each m G M. n : 

pen (m) = E w [P nl (s m (Pf )) - Pf 7 (s m (Pf ))] . 

3. Compute the selected model m(C) as a function of C > 

m(C) G arg min {P„7(s TO (P n )) + Cpen (m)} . 

4. Choose the minimal C = C such that D^c) is "reasonably small", and 
take m = m(2C). 

Step 4 may need to artificially introduce huge models in A4 n , all the other 
ones being considered as "reasonably small". Finally, notice that C \— > m(C) 
is piecewise constant with at most Card(.M„) jumps, so that steps 3-4 have 
a complexity (9(Card(A / i n )). As a consequence, the V^-fold algorithm [3] is 
fastly computable. 
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